量化金融导论1：资产收益的程式化介绍基于Python

原创 Eryk Lewinson 量化投资与机器学习 2022-05-14

本期作者：Eryk Lewinson

本期翻译：Wally

未经授权，严禁转载

我们想展示一个简单的分配策略，希望表明，利用数据科学和定量金融学基本知识，超越基准。当然，没有永远的圣杯。

有一个理论解释为什么这是不可能的，即有效市场假说（EMH）。它指出，资产价格完全反映了市场所有可获得的信息。这意味着由于市场价格只对新信息做出反应，因此根本不可能一直战胜市场。

Libraries 准备

在直接进入机器学习和构建资产分配策略之前，我认为在基础知识上花一些时间并理解它们，在后续建模是至关重要的。在这篇文章中，我们将研究资产回报的程序化过程，并展示如何使用Python验证。一些基本的统计知识会对你有帮助，但我们尝试去直观地解释这些问题。

# libraries ----
import pandas as pd 
import numpy as np
import quandl
import seaborn as sns

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import datetime as dt 

import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='matplotlib')

# settings ----
%matplotlib inline
sns.set_style('darkgrid')
sns.mpl.rcParams['figure.figsize'] = (10.0, 6.0)
sns.mpl.rcParams['savefig.dpi'] = 90
sns.mpl.rcParams['font.size'] = 14

# authentication ----
quandl_key = 'zs7xyLhXJbVU_Sk2-4aB' # paste your own API key here :)
quandl.ApiConfig.api_key = quandl_key

数据准备

在这篇文章中，我们使用复权价格，因为它考虑了股票分红等。

我选择Microsoft（股票代码：MSFT）作为示例，并以Dataframe的数据格式形下载该时间序列。然后，将价格转换为对数回报以做进一步分析：

# downloading the data 
df = quandl.get('WIKI/MSFT', start_date="2000-01-01", end_date="2017-12-31")
df = df.loc[:, ['Adj. Close']]
df.columns = ['adj_close']

# create simple and log returns, multiplied by 100 for convenience
df['simple_rtn'] = 100 * df.adj_close.pct_change()
df['log_rtn'] = 100 * (np.log(df.adj_close) - np.log(df.adj_close.shift(1)))

# dropping NA's in the first row
df.dropna(how = 'any', inplace = True)

df.head()

# Plotting the time series ----
fig, ax =plt.subplots(3, 1, figsize=(24, 20))
# price ----
df.adj_close.plot(ax=ax[0])
ax[0].set_ylabel('Stock price ($)')
ax[0].set_xlabel('')
ax[0].set_title('Price vs. returns')
# simple returns ----
df.simple_rtn.plot(ax=ax[1])
ax[1].set_ylabel('Simple returns (%)')
ax[1].set_xlabel('')
# log returns ----
df.log_rtn.plot(ax=ax[2])
ax[2].set_ylabel('Log returns (%)')
fig.show()

在图中可直接观察到的一个重要特征：回报较大的周期与回报率较小的周期交替，表明波动率不是恒定的。

风格的事实是，一般来说，统计属性出现在许多实证资产回报（在时间和市场）。重要的是要意识到它们的存在，因为在构建模型时应该代表资产价格动态，模型必须能够捕获/复制这些属性。

1、收益分配模式是不正常的

负偏度（第三个时刻）——较大的负回报比较大的正回报更常发生。表现：左尾长; 分布的集中在分布图的右侧。

超值峰度（第四个时刻）——较大大和较小回报的发生频率高于预期。表现：肥尾和超值分布。

下面展示了改股票的直方图和概率密度曲线。我们看到回报没有表现出更高的峰值（当然可以是这种情况），但是尾部的量肯定要比预期正常情况下要多。

红色的线条代表的标准正态分布。在返回值遵循高斯分布的情况下，这两条线是一致的。然而，我们也发现了存在差异，主要是尾部。这进一步验证了上述发现。

最后，看一下回报的描述性统计数。 Jarque-Bera正态性检验证实了我们的怀疑，p值小到足以拒绝零假设，表明数据遵循高斯分布。

# Descriptive statistics ----
print('Range of dates:', min(df.index.date), '-', max(df.index.date))
print('Number of observations:', df.shape[0])
print('Mean: {0:.4f}'.format(df.log_rtn.mean()))
print('Median: {0:.4f}'.format(df.log_rtn.median()))
print('Min: {0:.4f}'.format(df.log_rtn.min()))
print('Max: {0:.4f}'.format(df.log_rtn.max()))
print('Standard Deviation: {0:.4f}'.format(df.log_rtn.std()))
print('Skewness: {0:.4f}'.format(df.log_rtn.skew()))
print('Kurtosis: {0:.4f}'.format(df.log_rtn.kurtosis())) #Kurtosis of std. Normal dist = 0
print('Jarque-Bera statistic: {stat:.2f} with p-value: {p_val:.2f}'.format(stat = scs.jarque_bera(df.log_rtn.values)[0],
                                                                           p_val = scs.jarque_bera(df.log_rtn.values)[1]))

Range of dates: 2000-01-04 - 2017-12-29 Number of observations: 4526 Mean: 0.0175 Median: 0.0000 Min: -16.9683 Max: 17.8773 Standard Deviation: 1.9341 Skewness: -0.1239 Kurtosis: 9.9657 Jarque-Bera statistic: 18694.53 with p-value: 0.00

2、没有（或几乎没有）显著的回报自相关

自相关在连续的时间间隔内测量给定时间序列和相同系列的滞后序列之间的相似度。它类似于两个时间序列之间的相关性：第一个是原始形式，一个是滞后n个周期。

示例：当某个资产的回报呈现历史上正的自相关时，并且在过去几天价格上涨，人们可能会合理地预测会有进一步的正向变动（当然预测股票价格并不像那样简单）。

# Autocorrelation plot of log returns ----
acf_r = smt.graphics.plot_acf(df.log_rtn, lags=40 , alpha=0.5)
acf_r.show()

3、在方差和绝对收益中缓慢地减少自相关

在建模回报时，考虑到波动性在决策（买/卖）过程中可能是至关重要的。波动率通常被理解为收益的标准差（方差的平方根）。

现在，我们不考虑回报而是考虑错误，即实际值：模型预测/解释值。方差基本上是平方误差的平均值，而绝对偏差是绝对误差的平均值。通过绘制平方/绝对误差随时间的变化，我们可以看出方差（或绝对偏差，也是波动率的度量）是否随时间变化是恒定的。如果资产回报不是这种情况，我们可以观察到高/低波动的时期。这称为“波动率聚类”，可以在返回的时间序列图中观察到。

另一方面，长期（短期）每日平均回报预期为零（EMH）。这就是为什么通过查看平方和绝对回报，我们有效地测量与预期了均值的偏差，而不考虑误差的方向。

下面将介绍MSTF返回的自相关图，以及平方和绝对值。蓝色区域表示95％置信区间，其外部点具有统计学意义。我们看到，只有几个重要点与2一致。至于3，我们看到相关性是显着的，并且它们的下降比绝对回报更容易观察。总而言之，这使我们相信我们可以尝试利用自相关结构来进行波动率建模。

# specify the max amount of lags
lags = 40

fig, ax =plt.subplots(3, 1, figsize=(24, 20))
# price ----
smt.graphics.plot_acf(df.log_rtn, lags=lags , alpha=0.5, ax = ax[0])
ax[0].set_ylabel('Returns')
ax[0].set_title('Autocorrelation Plots')
# simple returns ----
smt.graphics.plot_acf(df.log_rtn ** 2, lags=lags, alpha=0.5, ax = ax[1])
ax[1].set_ylabel('Squared Returns')
ax[1].set_xlabel('')
ax[1].set_title('')
# log returns ----
smt.graphics.plot_acf(np.abs(df.log_rtn), lags=lags, alpha=0.5, ax = ax[2])
ax[2].set_ylabel('Absolute Returns')
ax[2].set_title('')
ax[2].set_xlabel('Lag')
fig.show()